Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baked data for icu_plurals, icu_timezone #3462

Merged
merged 19 commits into from
Jun 7, 2023

Conversation

robertbastian
Copy link
Member

@robertbastian robertbastian commented May 31, 2023

#2945

This adds data to subcrates (i.e. <component>/data). Users can provide custom baked data by setting the ICU4X_DATA_DIR environment variable during compilation. Crates will then fail to compile if the data doesn't contain the required keys.

The baked data uses --locales modern. I think that's a good default, if clients want other sets of languages they will have to use datagen.

While some of the generated files are big, the compressed tarballs for crates.io all fit comfortably within the 10MB limit.

64K     target/package/icu_calendar-1.2.0.crate
32K     target/package/icu_casemapping-0.7.2.crate
1.7M    target/package/icu_collator-1.2.0.crate
36K     target/package/icu_compactdecimal-0.2.0.crate
2.8M    target/package/icu_datetime-1.2.1.crate
20K     target/package/icu_decimal-1.2.0.crate
1.8M    target/package/icu_displaynames-0.10.0.crate
44K     target/package/icu_list-1.2.0.crate
60K     target/package/icu_locid_transform-1.2.1.crate
96K     target/package/icu_normalizer-1.2.0.crate
44K     target/package/icu_plurals-1.2.0.crate
316K    target/package/icu_properties-1.2.0.crate
392K    target/package/icu_relativetime-0.1.1.crate
7.4M    target/package/icu_segmenter-1.2.1.crate
20K     target/package/icu_timezone-1.2.0.crate

On the API side, I propose adding try_new[_with_locale[_with_options]] constructors. These provider-less constructors become the main documentation entry point, with constructors being listed in the order *, _with_any_provider, *_with_buffer_provider, *_unstable. The provider-constructors refer to the simple constructor's documentation. Example.

In locale-agnostic components (calendar, normalizer, locid_transform, properties, timezone, casemapping) we can create infallible constructors, which can usually be const (as these don't take arguments, they could also be const values 1). Example.

Footnotes

  1. My thinking here is to use const fn for constructors, and const values for what would be free functions, like in icu_properties.

@robertbastian robertbastian force-pushed the fullbake branch 2 times, most recently from 90dd7c8 to 1250ef8 Compare May 31, 2023 14:25
@jira-pull-request-webhook

This comment was marked as spam.

@jira-pull-request-webhook

This comment was marked as spam.

@jira-pull-request-webhook

This comment was marked as spam.

@jira-pull-request-webhook

This comment was marked as spam.

@jira-pull-request-webhook

This comment was marked as spam.

@jira-pull-request-webhook

This comment was marked as spam.

@jira-pull-request-webhook

This comment was marked as spam.

@jira-pull-request-webhook

This comment was marked as spam.

@jira-pull-request-webhook

This comment was marked as spam.

@jira-pull-request-webhook

This comment was marked as spam.

@jira-pull-request-webhook

This comment was marked as spam.

@jira-pull-request-webhook

This comment was marked as spam.

@jira-pull-request-webhook

This comment was marked as spam.

@jira-pull-request-webhook

This comment was marked as spam.

@robertbastian robertbastian requested a review from Manishearth June 5, 2023 13:00
Manishearth
Manishearth previously approved these changes Jun 5, 2023
This reverts commit 961b6c1.
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work; need to align on open questions

// called LICENSE at the top level of the ICU4X source tree
// (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).

fn main() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two separate discussions in this thread.

Custom globaldata path: In #2992 I proposed the dual features approach, but I like the cfg better. A disadvantage of the dual features is that you have the --all-features problem.

Macros in crates: Unless my assumptions are wrong, I see it as a non-starter to call macros in crates they don't belong to. It makes the crates heavier and much more difficult for DCE. There are a couple solution we discussed previously:

  1. The crates export a DataProvider implementing the bounds they need, and client crates fork between them manually (fork-by-key DataProvider works if you manually list out every key you need to fork on).
  2. The globaldata constructors call other globaldata constructors directly. This creates two code paths, one for globaldata and one for custom data, which isn't great, but most constructors are thin so it's probably fine.

Comment on lines -143 to +156
/// let pr = PluralRules::try_new_unstable(
/// &icu_testdata::unstable(),
/// let pr = PluralRules::try_new(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion: I'm not sure yet what we want to put as the primary example in the docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more and more of the opinion that anything with a provider should be considered a power-user feature. Google and Mozilla need it, but most users will not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not great for sample code to require enabling a feature flag.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I assume you're opposed to making it a default feature?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we come to an agreement that the data lives inside the component crates, then it should probably be a default feature

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does that matter? We can also pull in another crate by default.

components/plurals/data.json Outdated Show resolved Hide resolved
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you want me to re-review

@@ -0,0 +1,28 @@
# This file is part of ICU4X. For terms of use, please see the file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see, you implemented separate crates but distributed them inside the individual components instead of in a centralized location

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's please not block on where in the repo these will live

@@ -53,6 +55,7 @@ serde = ["dep:serde", "zerovec/serde", "icu_locid/serde", "icu_provider/serde"]
datagen = ["serde", "zerovec/databake", "dep:databake"]
experimental = []
bench = ["serde"]
data = ["dep:icu_plurals_data"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted elsewhere, we need to bikeshed the feature name (not a landing blocker)

# (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).
[package]
name = "icu_plurals_data"
version = "43.0.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: how about starting at "0.1.0" and incrementing to "0.1.1", "0.1.2", ... for CLDR updates and new data structs and "0.2.0" when there is a breaking data struct change. Then it's clearly a different versioning scheme than the rest of ICU4X.

(Maybe put new data structs also in a new channel, like breaking changes, so that minor releases are reserved for CLDR and we can continue to update old channels if desired)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we bikeshed versioning later?

components/plurals/data/build.rs Outdated Show resolved Hide resolved
},
"locales": {
"CldrSet": [
"modern"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to discuss the CldrSet to use (doesn't need to block this PR)

tools/make/data.toml Outdated Show resolved Hide resolved
@robertbastian robertbastian requested a review from sffc June 7, 2023 16:32
@robertbastian robertbastian changed the title Global data concept Global data for icu_plurals, icu_timezone Jun 7, 2023
@robertbastian robertbastian changed the title Global data for icu_plurals, icu_timezone Baked data for icu_plurals, icu_timezone Jun 7, 2023
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing my nitpicks 😃 This is good to land now and we can continue bikeshedding the details

@robertbastian robertbastian merged commit 10aa1ce into unicode-org:main Jun 7, 2023
@robertbastian robertbastian deleted the fullbake branch June 7, 2023 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants